Armance Larfeuille, Katia Voltz, Manoël Pidoux, Elodie Shoeiline Kwan and Nina Bidet
Teacher : O. Boldi Spring Semester 20222
We are doing this report as part of our “Machine Learning” course in which we have to produce a detailed report based on an original database. To do this, we will re-analyze an existing data set by adding new elements (new models, features, outcome labels, approaches, interpretations, etc). Our goal is to classify mountains according to different features. To do this, we will go though supervised and unsupervised learning methods.
After several searches, we found a very interesting article explaining the interactions between the impacts of climate, soil and biotic on the interplay of the different facets of Alpine plant diversity. The scientists collected these data on three mountains in different locations and with opposite characteristics.
The database provided with the article was detailed, interesting and included enough observations to conduct our project. An analysis has already been carried out on this data set and we will complete it by using the skills acquired during our semester of course. Indeed, we will add machine learning methods of classification and regression.
In the referenced article, different analysis have been made on the facets of alpine plant diversity; functional (FD), phylogenetic (PD) and taxonomic diversity (TD), in the three different mountains and with contrasting evolutionary histories and climate conditions. It means that the used database contains the plant species and the mountain characteristics. Their goal is predict the different diversity facet (response variables Y) thanks to the mountain features; Elevation, Potential solar radiation, Solar total nitrogen. The method used are Cluster analysis and SEM model.
Our project does not take into account the plants. The goal is to study the mountains only based on the soil characteristics.
Our goal is to use the relevant machine learning tools, with supervised and unsupervised learning methods to characterize our data frame. In our case, it would be a question of being able to determine which mountain it is according to the given soil characteristics.
In this first part, we will first download the data into R. Then, we will apply the necessary transformations on these data so that they can be correctly exploited for the following analysis.
Below is the first 4 observations of the datasets.
| Sample_ID | Country | Muntain Range | Locality | Plot | Plot_ID | Subplot | Date | Day | Month | Year |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Spain | Sierra de Guadarrama | Asomate-Hoyos | 1 | 1 | 1 | 2010-07-21 | 21 | 7 | 2010 |
| 2 | Spain | Sierra de Guadarrama | Asomate-Hoyos | 1 | 1 | 2 | 2010-07-21 | 21 | 7 | 2010 |
| 3 | Spain | Sierra de Guadarrama | Asomate-Hoyos | 1 | 1 | 3 | 2010-07-21 | 21 | 7 | 2010 |
| 4 | Spain | Sierra de Guadarrama | Asomate-Hoyos | 1 | 1 | 4 | 2010-07-21 | 21 | 7 | 2010 |
| 5 | Spain | Sierra de Guadarrama | Asomate-Hoyos | 1 | 1 | 5 | 2010-07-21 | 21 | 7 | 2010 |
| 6 | Spain | Sierra de Guadarrama | Bola del Mundo | 2 | 2 | 1 | 2010-06-22 | 22 | 6 | 2010 |
| Grid zone | UTM_X | UTM_Y | Elevation | Orientation | Slope | Radiation | Phos_P | Glu_P | SOC_P | NT_P |
|---|---|---|---|---|---|---|---|---|---|---|
| 30 T | 425862.7 | 4517724 | 2225 | 175 | 4 | 0.8088464 | 4.437515 | 2.505851 | 6.315554 | 4.070239 |
| 30 T | 425862.7 | 4517724 | 2225 | 175 | 4 | 0.8088464 | 4.437515 | 2.505851 | 6.315554 | 4.070239 |
| 30 T | 425862.7 | 4517724 | 2225 | 175 | 4 | 0.8088464 | 4.437515 | 2.505851 | 6.315554 | 4.070239 |
| 30 T | 425862.7 | 4517724 | 2225 | 175 | 4 | 0.8088464 | 4.437515 | 2.505851 | 6.315554 | 4.070239 |
| 30 T | 425862.7 | 4517724 | 2225 | 175 | 4 | 0.8088464 | 4.437515 | 2.505851 | 6.315554 | 4.070239 |
| 30 T | 417601.1 | 4515745 | 2242 | 182 | 2 | 0.7879971 | 5.171000 | 3.234583 | 5.092630 | 4.546233 |
| PT_P | K_P | pH_P | Cond_P | Phos_B | Glu_B | SOC_B | NT_B | PT_B | K_B |
|---|---|---|---|---|---|---|---|---|---|
| 0.456501 | 0.0086492 | 4.925 | 31.284 | 2.888396 | 1.691185 | 3.762122 | 3.297689 | 0.4450694 | 0.0025165 |
| 0.456501 | 0.0086492 | 4.925 | 31.284 | 2.888396 | 1.691185 | 3.762122 | 3.297689 | 0.4450694 | 0.0025165 |
| 0.456501 | 0.0086492 | 4.925 | 31.284 | 2.888396 | 1.691185 | 3.762122 | 3.297689 | 0.4450694 | 0.0025165 |
| 0.456501 | 0.0086492 | 4.925 | 31.284 | 2.888396 | 1.691185 | 3.762122 | 3.297689 | 0.4450694 | 0.0025165 |
| 0.456501 | 0.0086492 | 4.925 | 31.284 | 2.888396 | 1.691185 | 3.762122 | 3.297689 | 0.4450694 | 0.0025165 |
| 3.923043 | 0.0133951 | 5.232 | 48.020 | 3.102327 | 1.955476 | 2.993161 | 3.168254 | 3.1682544 | 0.0067595 |
| pH_B | Cond_B | Phos_T | Glu_T | SOC_T | NT_T | PT_T | K_T | pH_T | Cond_T |
|---|---|---|---|---|---|---|---|---|---|
| 5.402 | 22.198 | 3.593446 | 2.203418 | 5.736881 | 3.958871 | 0.4871885 | 0.0059856 | 5.214310 | 27.60922 |
| 5.402 | 22.198 | 3.593446 | 2.203418 | 5.736881 | 3.958871 | 0.4871885 | 0.0059856 | 5.214310 | 27.60922 |
| 5.402 | 22.198 | 3.593446 | 2.203418 | 5.736881 | 3.958871 | 0.4871885 | 0.0059856 | 5.214310 | 27.60922 |
| 5.402 | 22.198 | 3.593446 | 2.203418 | 5.736881 | 3.958871 | 0.4871885 | 0.0059856 | 5.214310 | 27.60922 |
| 5.402 | 22.198 | 3.593446 | 2.203418 | 5.736881 | 3.958871 | 0.4871885 | 0.0059856 | 5.214310 | 27.60922 |
| 5.760 | 28.700 | 5.607705 | 3.481617 | 5.288592 | 4.912220 | 4.8853870 | 0.0153168 | 5.505655 | 48.93783 |
We are deleting the variables that are not useful for use. The variables such as the UTM grid zone, UTM_X and UTM_Y are related to the samples and we are not wanting to make predictions based on those characteristics but rather on the actual soil composition. The variable Elevation, slope, plot_ID, Orientation will not help us neither because it is related to the place where the sample has been taken. It is a certain amount above the some tree limits. We rename the variable Muntain Range as Mountain_range to facilitate future use.
After these changes, the database Mountain_data is composed of 430 observation and 34 variables.
Now that we have the variables we want to work with, we see that they are all considered as numeric. However, we know that some of them should be categorical as they have different levels. We are going to transform Country, Mountain_Range, Locality, Plot and Subplot as factors.
We can export this cleaned dataset on the data folder of our project.
write.csv(Mountain_data,"../data/Mountain_data_cleaned.csv", row.names = FALSE)
For a better representation of the data we created 3 maps with
each icon representing a subsample belonging to a larger sample. Each
map represents a mountain region, two in Spain and one in Chile.
This section is dedicated to understanding the data. We will
provide an analysis of the data set using a visual approach in order to
summarize their main characteristics.
We will analyze some basic statistical elements for each variable. To do this we need to transform the variable Date to date format.
Below, we look at the general aspect of the data set and try to
discover if there are any missing values.
| rows | columns | all_missing_columns | total_missing_values | complete_rows | total_observations |
|---|---|---|---|---|---|
| 430 | 34 | 0 | 2 | 428 | 14620 |
There are 2 missing values in the feature
Glu_P. One is at the instance number 377 and the other
one is at the instance 378.
| Country | Mountain_range | Locality | Plot | Subplot | Date | Glu_P | |
|---|---|---|---|---|---|---|---|
| 377 | Chile | Central Andes | Baños de Colinas | 76 | 2 | 2014-01-21 | NA |
| 378 | Chile | Central Andes | Baños de Colinas | 76 | 3 | 2014-01-21 | NA |
To understand the distribution of our data in the data set we
use the following graph:
There is a predominance of the continous columns compared to the
discrete columns. Most of our variables will use both discrete and
continuous features. We can also notice that the share of missing
observations represents only 0.014% of the total number of observations.
Which at first sight makes it a good data set.
We take our
search for anomalies further by exploring the characteristics of each
variable. For the readability of the report, we show only a few
variables.
## Country
## n missing distinct
## 430 0 2
##
## Value Chile Spain
## Frequency 100 330
## Proportion 0.233 0.767
## Mountain_range
## n missing distinct
## 430 0 3
##
## Value Central Andes Central Pyrenees Sierra de Guadarrama
## Frequency 100 135 195
## Proportion 0.233 0.314 0.453
## Phos_P
## n missing distinct Info Mean Gmd .05 .10
## 430 0 268 1 3.477 2.26 0.6777 1.0753
## .25 .50 .75 .90 .95
## 1.9318 3.1503 4.7146 6.2213 7.1411
##
## lowest : 0.01980997 0.16225689 0.25041658 0.31268048 0.32430518
## highest: 7.73917664 7.90167428 8.05414937 8.64896403 8.64973041
## Glu_P
## n missing distinct Info Mean Gmd .05 .10
## 428 2 272 1 2.102 1.24 0.2986 0.5062
## .25 .50 .75 .90 .95
## 1.3679 2.0922 2.7710 3.2561 3.9541
##
## lowest : 0.1074305 0.1110000 0.1140850 0.1150734 0.1619782
## highest: 4.5538748 4.7529319 5.2224173 5.5441612 6.3505287
## NT_P
## n missing distinct Info Mean Gmd .05 .10
## 430 0 273 1 3.971 2.688 0.4501 0.7760
## .25 .50 .75 .90 .95
## 2.1575 3.9155 5.6767 7.2597 8.0787
##
## lowest : 0.1156901 0.1610857 0.1833210 0.2201596 0.2365415
## highest: 8.4518587 8.9110000 9.0285000 10.8160000 18.0010000
We can observe that there is a big difference between the total
number of observations and the number of distinct observations for the
variables related to the chemical elements. We will try to understand
where this difference comes from in the visual analysis part.
We plot the numerical variables.
Many variables appear with a distribution that looks like the
log-normal distribution.
We will then use box-plots to detect outliers on numerical
variable and compare the distributions to each mountain class.
We can see the real differences between the mountains.
We then plot the categorical variables:
We see that more observations come from Spain, which is normal
since two out of three mountains are located in Spain. The localities
where the samples were taken are almost all composed of a sample of 5
subsamples. Some localities, perhaps more interesting for the study,
were sampled several times, but always by a multiple of 5 subsamples.
We have more observations about the mountain “Sierra de
Guadarrama” (195) compared to “Central Andes” (100) and “Central
Pyrenees” (135). As the differences between the number of observation is
big enough for us to be careful on the results and consider to balance
the data if it is needed.
We can comment on the number of
different observations and the effect on accuracy. We can also
focus more on sensibility and sensitivy if needed since there
is twice more informations on Sierra de Guadarrama.
As
described above with the summary output of the data, we see that we have
more information on the mountain “Sierra de Guadarrama”. There is twice
more information compared to the mountain “Central Andes”. Our final
result might be affected on a bad way because the model will tend to
produce a good accuracy (so having a tendency to predict “Sierra de
Guadarrama” more often) but it will not be good enough to predict a new
instance.
We will have to see if we will need to balance
our data to get a better model.
We will also inspect the
possible duplicate observations, indeed as previously found, some
variables do not have the whole of their observations which are
distinct.
We notice immediately the poverty of the data concerning the
samples of Sierra de Guadarrama, this function leaves
us with a data set of only 274 observations. We will
therefore try a first time to implement our models by keeping the
duplicates, knowing that identical values in the train set and the test
set will influence the measured accuracy of the model. Then we will test
again our models with the reduced data set to observe if there is a loss
of accuracy.
For the rest of the EDA we will continue the
analysis on the complete data set.
From the correlation plot it seems that some pattern can be
observed. The variables concerning the Phosphatase
enzyme seems to be positively correlated with the variable
about Soil organic carbon.
With this plot, we see indeed that the families of Soil
organic carbon and Phosphatase enzyme are
significantly positively correlated. The correlation coefficient going
from 0.739 (SOC_B - Phos_P) to 0.947 (SOC_T - SOC_P).
This analysis helps us to understand the link between the explanatory variables.
The first step is to analyse the data in the covarianve matrix
as we did before, and where we found the positive correlation between
the Soil organic carbon and Phosphatase
enzyme.
The second step is to group the data into
Principal Components.
The third step is to produce a variable
factor map to better understand the role of each factor.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.3080 2.2090 1.6387 1.26851 1.02032 0.97230 0.82856
## Proportion of Variance 0.4377 0.1952 0.1074 0.06436 0.04164 0.03781 0.02746
## Cumulative Proportion 0.4377 0.6329 0.7403 0.80469 0.84633 0.88414 0.91161
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.7106 0.63948 0.58236 0.5050 0.41430 0.37199 0.32284
## Proportion of Variance 0.0202 0.01636 0.01357 0.0102 0.00687 0.00554 0.00417
## Cumulative Proportion 0.9318 0.94816 0.96173 0.9719 0.97879 0.98433 0.98850
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.25571 0.23393 0.21525 0.17576 0.17150 0.1415 0.12739
## Proportion of Variance 0.00262 0.00219 0.00185 0.00124 0.00118 0.0008 0.00065
## Cumulative Proportion 0.99111 0.99330 0.99515 0.99639 0.99757 0.9984 0.99902
## PC22 PC23 PC24 PC25
## Standard deviation 0.1115 0.08012 0.05935 0.04711
## Proportion of Variance 0.0005 0.00026 0.00014 0.00009
## Cumulative Proportion 0.9995 0.99977 0.99991 1.00000
Here, as the command prcomp do not allow NAs in the data. We use the command na.omit on our reduced data containing the numerical values to omit all NAs cases from the data frame.
For the further analysis, we can study as well the eigenvalues in
order to select a good number of components.
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 10.936952114 43.747808455 43.74781
## Dim.2 4.863450112 19.453800446 63.20161
## Dim.3 2.694400932 10.777603728 73.97921
## Dim.4 1.602728782 6.410915127 80.39013
## Dim.5 1.045432762 4.181731049 84.57186
## Dim.6 0.954822102 3.819288408 88.39115
## Dim.7 0.688518236 2.754072944 91.14522
## Dim.8 0.503178240 2.012712959 93.15793
## Dim.9 0.411065663 1.644262652 94.80220
## Dim.10 0.336691388 1.346765554 96.14896
## Dim.11 0.254731126 1.018924506 97.16789
## Dim.12 0.172594714 0.690378855 97.85826
## Dim.13 0.138243331 0.552973325 98.41124
## Dim.14 0.105400519 0.421602078 98.83284
## Dim.15 0.067346066 0.269384264 99.10222
## Dim.16 0.054954619 0.219818475 99.32204
## Dim.17 0.046558386 0.186233542 99.50828
## Dim.18 0.031588992 0.126355968 99.63463
## Dim.19 0.029601829 0.118407318 99.75304
## Dim.20 0.020191077 0.080764308 99.83380
## Dim.21 0.016267999 0.065071996 99.89888
## Dim.22 0.012623924 0.050495696 99.94937
## Dim.23 0.006886114 0.027544454 99.97692
## Dim.24 0.003535893 0.014143571 99.99106
## Dim.25 0.002235080 0.008940321 100.00000
We obtain the cumulative variance, as before, and also the eigenvalues.
Therefore, we can consider the dimension from 1 to 5:
The variable factor map show the variables and organized them along
dimensions. Here the first two dimensions are represented.
Dimension 1 (x-axis): highly correlated to Phos_T, Phos_B, Phos_P and
Glu_T
Dimension 1 is moderately correlated to PT_B
Dimension 1 is poorly correlated to Cond_T and Cond_B.
Dimension
2 is well correlated to Cond_T and Cond_B.
Dimension 2 is also
moderatly negatively correlated to Radiation.
It seems that we
have 4 groups of variables playing a different role. On these two
dimensions we notice that the mountain classes already separate into 3
distinct clusters
Dim 1: Highly correlated to the PHOS,
SOCand GLU
Dim 2: Correlated
with Cond_P and Cond_T
Dim 3:
Correlated with PT_P, PT_B and
PT_T
Dim 4: Moderately correlated to
K_B and K_T
Dim 5: Correlated
to Radiation
The square cosine shows the importance of a component for a given
observation. It is therefore normal that observations close to the
origin are less significant than those far from it. Here we decided to
represent only one variable of each type since the same chemical
elements tend to have the same behavior independently of their sampling
method. A variable that has an interesting behavior is
Radiation, indeed the more we select high dimensions
the more this variable becomes important (except for dimension 4), while
the variables related to chemical elements tend to decrease. Thus, we
find radiation strongly correlated with dimension 5.
As seen in the EDA, we can consider 5 dimensions. In the following graph we reduce in 3 dimensions the 3 mountains. Clusters may be apparent.
Through this 3D, we can observe the distribution of the 3 mountains in the PCA.Further in the analysis we will do a cluster analysis, to better understand the apparent classification between the mountains.
Before starting with the analysis we know that some of the models we use do not work with NAs. To deal with them, we decided to replace the 2 NAs we have by the mean of their closest observations.
The next step is to split our data set into a training set (Mountain_data.tr_notsubs) and a test set (Mountain_data.te). The training set has 75% of the observations and the test set has the 25% remainder.
As discussed in the EDA part, we should balance our data because we do not have the same amount of information on each mountains. We have more observations on Sierra de Guadarrama and half less on Central Andes.
We balanced the data on the training set Mountain_data.tr_notsubs.
##
## Central Andes Central Pyrenees Sierra de Guadarrama
## 79 79 79
Now we see that all three mountains has the same amount of observations. We decided to sub-samples the observations because we think that duplicating them will create more biases as we do not have that many observations.
Our new training set is Mountain_data.tr.
We start by fitting a Neural Network model on our balanced training set (Mountain_data.tr). In order to choose the parameters of this neural network, we applied a simple hyperparameter tuning.
The best Neural Networks parameters would be to choose 4 hidden layers, with a decay of 0.1 since it is the best combination that gives the highest accuracy value.
The manually written Neural Network model
## # weights: 747
## initial value 343.739521
## iter 10 value 260.416957
## iter 20 value 260.320004
## iter 30 value 249.074172
## iter 40 value 200.294799
## iter 50 value 167.751554
## iter 60 value 147.603498
## iter 70 value 118.282740
## iter 80 value 115.248739
## iter 90 value 16.684235
## iter 100 value 15.105342
## final value 15.105342
## stopped after 100 iterations
## Confusion Matrix and Statistics
##
## Reference
## Prediction Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 21 0 0
## Central Pyrenees 0 29 0
## Sierra de Guadarrama 0 0 58
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9664, 1)
## No Information Rate : 0.537
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Central Andes Class: Central Pyrenees
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.1944 0.2685
## Detection Rate 0.1944 0.2685
## Detection Prevalence 0.1944 0.2685
## Balanced Accuracy 1.0000 1.0000
## Class: Sierra de Guadarrama
## Sensitivity 1.000
## Specificity 1.000
## Pos Pred Value 1.000
## Neg Pred Value 1.000
## Prevalence 0.537
## Detection Rate 0.537
## Detection Prevalence 0.537
## Balanced Accuracy 1.000
With this confusion matrix command (confusionMatrix), we have more information on the model. As said before, we see that the accuracy is very high (99%) and we also see that we have a balanced accuracy of 1 which is the maximum we can get and which mean that our model do not suffer from unbalanced data.
The analysis of the “Naive Bayes” model comes back to the analysis of the density graphs of each variable by mountain range. From these density graphs we agree on what has been seen before, that the pH is a very good feature to classify the mountains. Indeed, pH_T ph_B and ph_P is medium in Central Andes, lower in Sierra de Guadarrama and higher in Central Pyrenees.
In addition to that, the observation of other variables such as phosphatase enzyme, (Phos_P, Phos_B, Phos_T), β-glucosidase enzyme (Glu_B, Glu_P, Glu_T) , soil organic carbon (SOC_T, SOC_B, SOC_P), soil total nitrogen (NT_P,NT_B,NT_T), electrical conductivity (Cond_P,Cond_B, Cond_T), and the radiation could allow us to get an idea of the mountain we are on. Indeed, a radiation lower than 0,6 indicates rather the Central Pyrenees. The electrical conductivity allows us to distinguish the Central Andes from the Central Pyrenees, since it is higher for the latter. If we look at the soil total nitrogen, we see that it is lower for the Central Andes than for the two others. The soil organic carbon allows us to distinguish the Central Andes from the Sierra de Guadarrama, since it is higher for this last one. The β-glucosidase enzyme is present at approximately the same level in Central Andes and Central Pyrenees but has a higher value in Sierra de Guadarrama. It is about the same for the phosphatase enzyme
However, some variables do not allow us to determine the mountain at all. It is the case by observing the phosphorus (PT_P, PT_B, PT_T) or the potassium content (K_P, K_B, K_T).
## Confusion Matrix and Statistics
##
## Reference
## Prediction Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 17 0 0
## Central Pyrenees 3 29 0
## Sierra de Guadarrama 1 0 58
##
## Overall Statistics
##
## Accuracy : 0.963
## 95% CI : (0.9079, 0.9898)
## No Information Rate : 0.537
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9379
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Central Andes Class: Central Pyrenees
## Sensitivity 0.8095 1.0000
## Specificity 1.0000 0.9620
## Pos Pred Value 1.0000 0.9063
## Neg Pred Value 0.9560 1.0000
## Prevalence 0.1944 0.2685
## Detection Rate 0.1574 0.2685
## Detection Prevalence 0.1574 0.2963
## Balanced Accuracy 0.9048 0.9810
## Class: Sierra de Guadarrama
## Sensitivity 1.0000
## Specificity 0.9800
## Pos Pred Value 0.9831
## Neg Pred Value 1.0000
## Prevalence 0.5370
## Detection Rate 0.5370
## Detection Prevalence 0.5463
## Balanced Accuracy 0.9900
Through this confusion matrix, we can see that the Naive Bayes accuracy is of 96.3% and the balancced accuracy of 90.48%. Indeed it lower than the Neural Network, but it remain good. 3 Central Pyrenees have been mispredicted, and 1 Sierra de Guadarrama.
We use a 2-NN to predict the test set using the training set
## Confusion Matrix and Statistics
##
## Reference
## Prediction Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 21 0 0
## Central Pyrenees 0 29 0
## Sierra de Guadarrama 0 0 58
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9664, 1)
## No Information Rate : 0.537
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Central Andes Class: Central Pyrenees
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.1944 0.2685
## Detection Rate 0.1944 0.2685
## Detection Prevalence 0.1944 0.2685
## Balanced Accuracy 1.0000 1.0000
## Class: Sierra de Guadarrama
## Sensitivity 1.000
## Specificity 1.000
## Pos Pred Value 1.000
## Neg Pred Value 1.000
## Prevalence 0.537
## Detection Rate 0.537
## Detection Prevalence 0.537
## Balanced Accuracy 1.000
With this KNN supervised method, the prediction made on the test set gives us an accuracy of 1. It looks indeed perfect and attractive, but it is complytely biased by the fact that subplot can be identical to plot. Therefore, there is an overlay in the pbservations and the distance computed is 0. What we need to do is to make a unite dataset with no observations overlayed.
We now want to replicate the above models with a another training set and test set that has been through a boostraping method. To do so, we first want to delete all the repeated observations from the original cleaned dataset.
We remove every replicated data with the “unique” function. We decided to do so to see if we will have better results. As we see above, the accuracy of our models are very high and this is normally unlikely. So we are testing if the high accuracy is due to the fact that we have duplicated observations in our dataset. (The scientists did sample different observations, but it comes out that the samples are exactly the same, they are maybe different at a very very small digits.)
We have seen that the new Unique data set is largely unbalanced. The mountain that has the most information is now the one with the less observations, so information.
Sierra de Gaudarrama, after removing every replicated data with the “unique” function, only have 39 observations. It is unbalanced indeed, and insufficient to make accurate classification methods.
Now that we have a new dataset, we have to split it again into a training set (Mountain_df.tr_notsubs) and a test set (Mountain_df.te ).
Once the duplicates are removed, we are left with a data set that is low in data, especially in the observations for the Sierra de Guadarrama class. With the poverty of our data we want to test an implementation of a random forest model. Indeed, the bagging should allow us to overcome the lack of data and give the best possible model. Another advantage of the random forest model is that it allows us to easily identify the importance of variables.
Here we can see the difference between the descending average of the precision that each variable adds to the model and the descending average of Gini which is a coefficient that shows the contribution of each variable to the homogeneity of nodes and leaves. We notice that some variables have a higher precision importance than their importance based on their Gini score, for our selection of future candidate variables for a simpler model we will make a compromise between these two measures.
From the variable importance we see that the variable pH_P, pH_T and pH_B are very important. We can then say that the pH in general is important.
## predtr
## Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 69 0 0
## Central Pyrenees 0 107 0
## Sierra de Guadarrama 0 0 29
## Accuracy of the train set = 1
The first table is the confusion matrix of the Random Forest model on the training set. Therefore it is normal to have an accuracy of 1 since the model already knows the observations.
## predte
## Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 31 0 0
## Central Pyrenees 0 28 0
## Sierra de Guadarrama 0 0 10
## Accuracy of the test set = 1
Then the second table is the confusion matrix of the random forest model on the test set. We still have a precision of 1, which makes it a good model. We have to keep in mind that the data set is very small and not necessarily representative of a larger sample.
Currently the model has many variables and we notice that many of them are not important for mountain classification. We are therefore thinking of testing a model with the 5 main variables to avoid overloading the model. Having less variables would allow us to predict a larger set of data with less computational power and resources. Moreover, the time saving could also be done during sampling, since less chemical elements would be needed to predict the class.
## predtr
## Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 69 0 0
## Central Pyrenees 0 107 0
## Sierra de Guadarrama 0 0 29
## Accuracy of the train set = 1
## predte
## Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 31 0 0
## Central Pyrenees 1 27 0
## Sierra de Guadarrama 0 0 10
## Accuracy of the test set = 0.9855072
Even with this model composed of fewer variables we still see a good accuracy of the test set compared to the train set. Here the error comes from the fact that we will have more tendency to predict Central Andes since we cannot balance the data between the classes otherwise we will reduce too much the size of the data set.
Overall the random forest model seems to be a good model that can be used with fewer variables while keeping a high accuracy.
But we now bootstrap to re-test some of our models and check if the accuracy changes once the duplicate data is removed.
We can now proceed to the bootstraping with 100 replicates
We can now replicate the models.
Simple hyperparameter tuning, this code takes time to run.
The best neural network should have 4 hidden units and a decay of 0.1. It is the same as before.
## # weights: 375
## initial value 284.692893
## iter 10 value 116.079753
## iter 20 value 57.538311
## iter 30 value 29.136661
## iter 40 value 22.652969
## iter 50 value 21.197795
## iter 60 value 20.339347
## iter 70 value 18.971778
## iter 80 value 18.386184
## iter 90 value 18.351844
## iter 100 value 18.341811
## final value 18.341811
## stopped after 100 iterations
Now that we fitted the model and made the prediction, we can use the confusion matrix to see is the model performed well or not.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 24 0 0
## Central Pyrenees 0 39 0
## Sierra de Guadarrama 0 0 10
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9507, 1)
## No Information Rate : 0.5342
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Central Andes Class: Central Pyrenees
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.3288 0.5342
## Detection Rate 0.3288 0.5342
## Detection Prevalence 0.3288 0.5342
## Balanced Accuracy 1.0000 1.0000
## Class: Sierra de Guadarrama
## Sensitivity 1.000
## Specificity 1.000
## Pos Pred Value 1.000
## Neg Pred Value 1.000
## Prevalence 0.137
## Detection Rate 0.137
## Detection Prevalence 0.137
## Balanced Accuracy 1.000
## Confusion Matrix and Statistics
##
## Reference
## Prediction Central Andes Central Pyrenees Sierra de Guadarrama
## Central Andes 24 0 0
## Central Pyrenees 0 39 3
## Sierra de Guadarrama 0 0 7
##
## Overall Statistics
##
## Accuracy : 0.9589
## 95% CI : (0.8846, 0.9914)
## No Information Rate : 0.5342
## P-Value [Acc > NIR] : 5.772e-16
##
## Kappa : 0.9281
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Central Andes Class: Central Pyrenees
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 0.9118
## Pos Pred Value 1.0000 0.9286
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.3288 0.5342
## Detection Rate 0.3288 0.5342
## Detection Prevalence 0.3288 0.5753
## Balanced Accuracy 1.0000 0.9559
## Class: Sierra de Guadarrama
## Sensitivity 0.70000
## Specificity 1.00000
## Pos Pred Value 1.00000
## Neg Pred Value 0.95455
## Prevalence 0.13699
## Detection Rate 0.09589
## Detection Prevalence 0.09589
## Balanced Accuracy 0.85000
Now, the accuracy is of 95.89%, lower than before, but might be more realistic with the unique method and the bootstrap. The balanced accuracy remains equal to 1.
Using the dendrogram with complete linkage on Manhattan distance with the silhouette method, we identified k=5 as the optimal number of clusters.
By analyzing this clusters, we can differentiate some of them. Firstly, cluster 1 seems to be more or less in the average of the other clusters for all variables, except for its pH, (pH_B, pH_P and pH_T) which is well below the others. Cluster 2 is distinguished by its high PT_P, PT_T and PT_B. Cluster 3 is characterized by a high pH_B, pH_P and pH_T and a low β-glucosidase enzyme (Glu_B, Glu_P and Glu_T) content (like cluster 5). By looking at the distribution of cluster 4 we see that it contains few observations but that it is well distinguished from the others. Indeed, only one line appears where the median merges with the width of its distribution. This cluster has the lowest radiation, the highest soil organic carbon (SOC_B, SOC_T and SOC_P) and highest soil total nitrogen (NT_T, NT_B and NT_P). Finally, the last cluster has a pH (pH_B, pH_P and pH_T) close to zero and a high radiation. It has oservations with the lowest soil organic carbon (SOC_B, SOC_T and SOC_P) On the silhouette plot, we see that the average silhouette considering the all five clusters is 0.45. Cluster 1 and 4 are the most homogeneous clusters and with the highest silhouette, respectively, 0,56 and 0,70. Cluster 5 has some mountains badly clustered as it has negative silhouettes, and the lowest silhouette (0,15).
To conclude, we will do a final score analysis of each models, so we can select the best one.
To recap, we studied the following models with a balanced dataset and using data splitting to avoid over-fitting:
Then we fitted the models on an unique dataset, to avoid the overlayed observations.
To conclude about the variable importance, the most recurrent one, the one that really impact and makes a difference in the classification is the pH: pH_B, pH_T, pH_P. Without this soil characteristic, the accuracy would not be that good.
To sum up, all the models have an accuracy above 90%. Neural Network seems to perform very well, before and after the unique data set. Moreover we could use this model to deal with all the variables. However, we have to bear in mind the size of the data set and the overlayed observations. Overall the random forest model seems to also be a good model that can be used with fewer variables while keeping a high accuracy.
The main limitations come from our data set used to train the model and test it. Indeed, as we noticed during the EDA we have a lot of duplicate data for the Sierra de Guadarrama region, these data when not removed from the data set influence the accuracy of both the test set and the train set. It is obvious that the prediction of identical data leads to a good result. Moreover, even if we remove these duplicates, the fact that the sub samples were taken close to each other prevents us from having results that are representative of a real study that would be conducted with samples taken from more widely spaced areas. We therefore have a false good precision, since the values to predict are close to those used to train the model. Ideally we should use only one sub sample per sample for our model, but in this case we would be sorely lacking in data to train our model.
In the analysis, we discovered a strong positive correlation between Soil organic carbon and Phosphatase enzyme. For the future analysis, it would be interesting to see the features and characteristics that influence these two variables.
We found an interesting articles describing the effects of phosphate rock and organic inputs on soil organic carbon and acid phosphatase activity.
Title: Soil Organic Carbon and Acid Phosphatase Enzyme Activity Response to Phosphate Rock and Organic Inputs in Acidic Soils of Central Highlands of Kenya in Maize: https://ir-library.ku.ac.ke/bitstream/handle/123456789/22271/Soil%20Organic%20Carbon%20and%20Acid%20Phosphatase%20Enzyme....pdf?sequence=1
With more time an interesting study could be to look on the map if the samples were taken close to a watercourse by transforming this information into a binary variable. Then we can analyze if there is a correlation between the chemical elements of the soil and their proximity to a water point, independently of the region and the climate. Thus, depending on the result of this analysis, we will be able to specify the description of the quality of a sample in terms of its ability to predict a class. The operation could be repeated with the samples close to a forest, as well as with the altitude of sampling.